1 Exercise 01

Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.

jan14_vic_elec <- vic_elec %>%
  filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
  index_by(Date = as_date(Time)) %>%
  summarise(
    Demand = sum(Demand),
    Temperature = max(Temperature)
  )
  1. Plot the data and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?

## Series: Demand 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -49978.2 -10218.9   -121.3  18533.2  35440.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391  0.00203 ** 
## Temperature   6154.3      601.3  10.235 3.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832,  Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11

There is a positive relationship since the hotter it gets, the more electricity gets spent on cooling (air conditioners etc.).

  1. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.93736, p-value = 0.06978

The model is adequate since the errors are uncorrelated and are mostly normally distributed (Shapiro–Wilk test gives p-value of 0.06978). There are no outliers.

Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15\(^\circ\)C and compare it with the forecast if the with maximum temperature was 35\(^\circ\)C . Do you believe these forecasts? The following R code will get you started:

plot_forecast <- function(target_temp) {
    out <- fit_lm %>%
  forecast(
    new_data(jan14_vic_elec, 1) %>%
      mutate(Temperature = target_temp)
  ) %>%
  autoplot(jan14_vic_elec) +
        geom_point(aes(color = Temperature, x = Date, y = Demand), size = 5) +
        scale_color_viridis_c() +
        scale_y_continuous(limits = c(100e3, 350e3),
                           labels = scales::comma_format(),
                           n.breaks = 10)
    
    return(out)
}

cowplot::plot_grid(
    plot_forecast(15),
    plot_forecast(35)
)

jan14_vic_elec %>% 
    as_tibble() %>% 
    arrange(Temperature)
## # A tibble: 31 × 3
##    Date        Demand Temperature
##    <date>       <dbl>       <dbl>
##  1 2014-01-06 195241.        19.6
##  2 2014-01-07 199770.        20  
##  3 2014-01-04 173798.        20.3
##  4 2014-01-25 181466.        20.3
##  5 2014-01-03 189086.        22.2
##  6 2014-01-19 187126.        22.3
##  7 2014-01-24 225570.        22.4
##  8 2014-01-12 187661.        22.5
##  9 2014-01-02 188351.        23  
## 10 2014-01-21 222492.        23.1
## # … with 21 more rows
## # ℹ Use `print(n = ...)` to see more rows

The forecast for first plot seems dodgy, since the value of Temperature = 15 was never seen in the data. Therefore, we are extrapolating. Let’s see the predicted values for a range of numbers:

I doubt that people will not use electricity when it is cold. :)

Give prediction intervals for your forecasts.

## When the temperature is 15*C, on average we can expect total electricity demand of 151.398 MWH. We can 95% sure that the demand will be between 100.179 and 202.617 MWH.
## 
## When the temperature is 35*C, on average we can expect total electricity demand of 274.484 MWH. We can 95% sure that the demand will be between 224.939 and 324.029 MWH.

Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?

## Series: Demand 
## Model: TSLM 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -62818.41 -13699.52    -25.83  17452.70 118947.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 220345.8     2742.3  80.352   <2e-16 ***
## Temperature    172.0      125.9   1.366    0.172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25470 on 1094 degrees of freedom
## Multiple R-squared: 0.001702,    Adjusted R-squared: 0.0007897
## F-statistic: 1.865 on 1 and 1094 DF, p-value: 0.17229

This tells me that the relationship between Demand and Temperature is non-linear. Linear model is not good for modelling this kind of relationship. There is also lots of seasonality present …

2 Exercise 02

Data set olympic_running contains the winning times (in seconds) in each Olympic Games sprint, middle-distance and long-distance track events from 1896 to 2016.

  1. Plot the winning time against the year for each event. Describe the main features of the plot.

Gap is present in the data due to the WWII. Most series have downward trend. In some series we can see that the new winning records are actually worse than the past records.

  1. Fit a regression line to the data for each event. Obviously the winning times have been decreasing, but at what average rate per year?
## # A tibble: 14 × 8
##    Length Sex   .model            term  estimate std.error statistic  p.value
##     <int> <chr> <chr>             <chr>    <dbl>     <dbl>     <dbl>    <dbl>
##  1  10000 women TSLM(Time ~ Year) Year   -3.50     0.701      -4.99  2.47e- 3
##  2  10000 men   TSLM(Time ~ Year) Year   -2.67     0.192     -13.9   2.37e-12
##  3   5000 men   TSLM(Time ~ Year) Year   -1.03     0.104      -9.88  1.50e- 9
##  4   1500 men   TSLM(Time ~ Year) Year   -0.315    0.0418     -7.54  5.23e- 8
##  5   5000 women TSLM(Time ~ Year) Year   -0.303    1.73       -0.175 8.69e- 1
##  6    800 women TSLM(Time ~ Year) Year   -0.198    0.0384     -5.16  1.45e- 4
##  7    800 men   TSLM(Time ~ Year) Year   -0.152    0.0177     -8.58  6.47e- 9
##  8    400 men   TSLM(Time ~ Year) Year   -0.0646   0.00586   -11.0   2.75e-11
##  9    400 women TSLM(Time ~ Year) Year   -0.0401   0.0170     -2.35  3.65e- 2
## 10    200 women TSLM(Time ~ Year) Year   -0.0336   0.00568    -5.92  2.17e- 5
## 11    200 men   TSLM(Time ~ Year) Year   -0.0249   0.00181   -13.7   3.80e-13
## 12    100 women TSLM(Time ~ Year) Year   -0.0142   0.00187    -7.58  3.72e- 7
## 13    100 men   TSLM(Time ~ Year) Year   -0.0126   0.00120   -10.5   7.24e-11
## 14   1500 women TSLM(Time ~ Year) Year    0.147    0.106       1.38  1.96e- 1

We can see the biggest downward trend in the category 10.000 / women: every year, the winning time is lower by 3.50 seconds on average. Unfortunately, in one of the series, winning times are actually increasing on average: 1.500 / women.

  1. Plot the residuals against the year. What does this indicate about the suitability of the fitted lines?

Most models are quite good (judging by the \(R^2\)), while the rest of the models need some improvement. Most worrisome is 5000 / women and 1500 / women.

  1. Predict the winning time for each race in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

I’ve made a decision to forecast next 3 relevant years (2020, 2024, 2028). The problematic models identified earlier show neutral or positive trend, which is nonsensical.

Anyways, the assumption are:

3 Exercise 03

An elasticity coefficient is the ratio of the percentage change in the forecast variable (\(y\)) to the percentage change in the predictor variable (\(x\)). Mathematically, the elasticity is defined as \((dy / dx) \times (x/y)\). Consider the log-log model:

\[ \log{y} = \beta_0 + \beta_1 \log{x} + \epsilon \]

Express \(y\) as a function of \(x\) and show that the coefficient \(\beta_{1}\) is the elasticity coefficient.

Helpful links: link_01, link_02 and link_03.

Solution - first, we ignore \(\epsilon\) as unrelated to \(y\), and then we can apply chain rule:

\[ \frac{d \log{y}}{d \log{x}} = \frac{d \log{y}}{d y} \frac{d y}{d x} \frac{d x}{d log x} \]

We apply derivation rules to relevant functions (reminder):

\[ \frac{d \log{y}}{d y} = \frac{1}{y} \]

\[ \frac{d x}{d log x} = x \] Applying these conclusions to the first equation yields us:

\[ \frac{d \log{y}}{d \log{x}} = \frac{d \log{y}}{d y} \frac{dy}{dx} \frac{dx}{d log x} = \frac{1}{y} \frac{dy}{dx} x = \frac{dy}{dx} \frac{x}{y} \]

… which is exactly the definition of elasticity as mentioned in the exercise.

4 Exercise 04

The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

  1. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

Exponential growth last three years.

  1. Explain why it is necessary to take logarithms of these data before fitting a model.

Relevant sources: here, here, here, here and here (last resource was extremely helpful).

Reasons for transformation of data is immediately apparent in the plot: the transformed, dependent variable now has linear relationship to predictor variable, so it will be easier to model. Also, logged data is no longer skewed and there is added bonus of reducing heteroscedasticity of the transformed data.

  1. Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

Info: surfing festival is held annually in March (source):

Date: 10th – 17th March The Noosa festival of surfing is held annually in March, in one of the world’s great surf locations. First Point at Noosa Heads is a perfect surfing spot for longboard surfing (apparently). This is due to the breezes from the South-East going around the tip of Noosa National Park.

Note: previous edition of the book had this same task, and two solution )here and here) explicitly modeled dummy variable taking into account the year from which competition started.

souvenirs_mod <- souvenirs %>% 
    mutate(SurfingFestival = month(Month) == 3 & year(Month) >= 1988)

souvenirs_fit <- souvenirs_mod %>% 
    model(TSLM(log(Sales) ~ trend() + season() + SurfingFestival))


report(souvenirs_fit)
## Series: Sales 
## Model: TSLM 
## Transformation: log(Sales) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.336727 -0.127571  0.002568  0.109106  0.376714 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.6196670  0.0742471 102.626  < 2e-16 ***
## trend()             0.0220198  0.0008268  26.634  < 2e-16 ***
## season()year2       0.2514168  0.0956790   2.628 0.010555 *  
## season()year3       0.2660828  0.1934044   1.376 0.173275    
## season()year4       0.3840535  0.0957075   4.013 0.000148 ***
## season()year5       0.4094870  0.0957325   4.277 5.88e-05 ***
## season()year6       0.4488283  0.0957647   4.687 1.33e-05 ***
## season()year7       0.6104545  0.0958039   6.372 1.71e-08 ***
## season()year8       0.5879644  0.0958503   6.134 4.53e-08 ***
## season()year9       0.6693299  0.0959037   6.979 1.36e-09 ***
## season()year10      0.7473919  0.0959643   7.788 4.48e-11 ***
## season()year11      1.2067479  0.0960319  12.566  < 2e-16 ***
## season()year12      1.9622412  0.0961066  20.417  < 2e-16 ***
## SurfingFestivalTRUE 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567,  Adjusted R-squared: 0.9487
## F-statistic:   119 on 13 and 70 DF, p-value: < 2.22e-16
  1. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

Some autocorrelation is left in the residuals, but it’s nothing alarming. Residuals also seem slightly (but only sightly skewed). Are they normal?

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.98961, p-value = 0.7435

Yes, they are. 💪

Now, to plot the residuals against the fitted values:

No obvious pattern in the residuals after accounting for trand, seasonality and March festival.

EXTRA: we can check the accuracy of the model in the following way:

  1. Do boxplots of the residuals for each month. Does this reveal any problems with the model?

The model makes the biggest residuals in November and December. Also, the residuals in March are interesting: could the sudden shift be due to the Surfing Festival? Did we account for all information?

Residuals for each month are also normally distributed.

## # A tibble: 12 × 3
##    Month pvalue variance
##    <int>  <dbl>    <dbl>
##  1     1 0.298   0.0419 
##  2     2 0.153   0.0455 
##  3     3 0.487   0.0437 
##  4     4 0.279   0.00890
##  5     5 0.979   0.0362 
##  6     6 0.560   0.00390
##  7     7 0.919   0.0151 
##  8     8 0.269   0.0405 
##  9     9 0.515   0.0457 
## 10    10 0.426   0.0464 
## 11    11 0.0847  0.0121 
## 12    12 0.0461  0.0338

At significance level of 5%, we can reject the null hypothesis that the residuals are not normally distributed in all months except December (tho we are very close).

But, there is some heteroskedasticity present, but IMO it is not cause for concern.

  1. What do the values of the coefficients tell you about each variable?
## Series: Sales 
## Model: TSLM 
## Transformation: log(Sales) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.336727 -0.127571  0.002568  0.109106  0.376714 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.6196670  0.0742471 102.626  < 2e-16 ***
## trend()             0.0220198  0.0008268  26.634  < 2e-16 ***
## season()year2       0.2514168  0.0956790   2.628 0.010555 *  
## season()year3       0.2660828  0.1934044   1.376 0.173275    
## season()year4       0.3840535  0.0957075   4.013 0.000148 ***
## season()year5       0.4094870  0.0957325   4.277 5.88e-05 ***
## season()year6       0.4488283  0.0957647   4.687 1.33e-05 ***
## season()year7       0.6104545  0.0958039   6.372 1.71e-08 ***
## season()year8       0.5879644  0.0958503   6.134 4.53e-08 ***
## season()year9       0.6693299  0.0959037   6.979 1.36e-09 ***
## season()year10      0.7473919  0.0959643   7.788 4.48e-11 ***
## season()year11      1.2067479  0.0960319  12.566  < 2e-16 ***
## season()year12      1.9622412  0.0961066  20.417  < 2e-16 ***
## SurfingFestivalTRUE 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567,  Adjusted R-squared: 0.9487
## F-statistic:   119 on 13 and 70 DF, p-value: < 2.22e-16

The model captured positive trend that is visible in the data. For example, if we are forecasting for March of any year, we can on average expect that log(Sales) will be bigger by 0.5015151. Regarding season()year_INT values, let’s interpret season()year2 coefficient: we are expecting (on average) that the log of future sales will be 25% bigger compared to the log of sales of 2 years prior.

  1. What does the Ljung-Box test tell you about your model?

Important resource: for interpretation of the test, especially regarding p-values. Hyndman also has great article on the topic of appropriate lags.

# based on Hyndman's article, linked above
TARGET_LAG <- min(nrow(souvenirs_mod) / 5, 2 * 12) %>% round(0)

augment(souvenirs_fit) %>% 
    features(.innov, ljung_box, lag = TARGET_LAG) 
## # A tibble: 1 × 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 TSLM(log(Sales) ~ trend() + season() + SurfingFestival)    81.4  2.13e-10

The p-value is very close to zero, and we can reject the \(H_{0}\) that the data is independently distributed. In other words, residuals exhibit autocorrelation at the specified lag. This means that there might be some problems with the forecasts because of the autocorrelation or heteroskedasticity in our residuals.

  1. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

  1. How could you improve these predictions by modifying the model?

We could probably add even more holiday seasons and/or encode month name/number as category.

5 Exercise 05

The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

  1. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

I see that the time series are very well approximated by Fourier Series. The correct value of K should be found by testing AIC vs. the K parameter.

  1. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.
## # A tibble: 25 × 2
##       AIC     K
##     <dbl> <dbl>
##  1 -1888.     7
##  2 -1887.    12
##  3 -1887.    11
##  4 -1886.     8
##  5 -1884.     6
##  6 -1884.    13
##  7 -1883.     9
##  8 -1883.    10
##  9 -1882.    14
## 10 -1879.    15
## # … with 15 more rows
## # ℹ Use `print(n = ...)` to see more rows

  1. Plot the residuals of the final model using the gg_tsresiduals() function and comment on these. Use a Ljung-Box test to check for residual autocorrelation.

Constant variance / mean of residuals, residuals are uncorrelated and normally distributed. Awesome!

  1. Generate forecasts for the next year of data and plot these along with the actual data for 2005. Comment on the forecasts.

Reasonable forecast.

6 Exercise 06

The annual population of Afghanistan is available in the global_economy data set.

  1. Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?

Graph speaks for itself.

  1. Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989.

Well, it’s obvious that the second model better fits the training data.

  1. Generate forecasts from these two models for the five years after the end of the data, and comment on the results.

First model forecasts downward trajectory of the population because it takes into account the Soviet-Afghan war in it’s calculations. On the other hand, second model generates more sensible forecasts since the war was one-time event (which was taken care of with knots in the model).

7 Exercise 07

I could probably do it, but I want to go on with the book, so… 😊